These videos reference the following articles:
Los Angeles Times 2008 article Study finds hospitals slow to defibrillate
Gallup 2018 article Americans Hit the Breaks on Self-Driving Cars
In order to generalize results from a sample to a larger population, the sample must be chosen in a way that is representative of the population.
Sampling bias occurs when certain individuals or
groups are more likely to be included in a study than others
-Ex: sample only engineering students about self-driving cars
Voluntary response and non-response
bias occur when only a small percentage of people selected in a
sample respond. Those who respond might be systematically different than
those who do not.
-Ex: respondents might have stronger opinions (or more time on their
hands) than others
Researchers should randomly select participants and follow up using multiple methods to reach as many individuals as possible
A confounding variable is a variable related to both the explanatory and response variable, so that its effects cannot be separated from the effects of the explanatory variable.
An observational study is a study in which researchers observe individuals and measure variables of interest but do not intervene in order to attempt to influence responses
An experiment is a study in which experimental units are randomly assigned to two or more treatment conditions and the explanatory variable is actively imposed on the subjects
In an observational study, we can never conclude that one variable causes a change in the other due to the possibility of confounding variables
In an experiment, we may conclude that one variable causes a change in the other since we have controlled for confounding variables
Control: Researchers assign subjects to different treatments to control for differences between groups.
Randomization Subjects are randomly assigned to groups so that there are no systematic differences between groups, which could introduce confounding factors.
Replication The more subjects are studied, the more precisely we can estimate effects being studied.
Blocking Some experiments group patients with similar characteristics, such as good health or poor health before assigning treatments. This assures that each treatment group has the same number of good health and poor health treatments.
A placebo is a fake treatment given to account for the possibility of subjects experiencing an effect simply from believing they received a treatment.
A double blind experiment is one in which neither the subjects nor the people administering the treatment know whether the subject received a treatment or placebo
We can only generalize from sample to population when a sample is randomly selected.
We can only infer causation when using a randomized experiment.
| Treatments Randomly Assigned | Treatments Not Randomly Assigned | |
|---|---|---|
| Sample Randomly Collected | Can infer causation and generalize to population | Can generalize results |
| Sample Not Randomly Collected | Can infer causation | Cannot infer causation or generalize results |
We will look at a dataset with information on 272 movies released in 2018, which was obtained from https://www.imdb.com/.
We have information on each film’s
Movies that generated more than $250 million in revenue:
## Title IMDB Rating Genre Runtime Revenue
## 1 Black Panther 7.3 PG-13 Action 134 700.06
## 2 Avengers: Infinity War 8.5 PG-13 Action 149 678.82
## 3 Incredibles 2 7.7 PG Animation 118 608.58
## 4 Jurassic World: Fallen Kingdom 6.2 PG-13 Action 128 417.72
## 5 Aquaman 7.2 PG-13 Action 143 334.14
## 6 Deadpool 2 7.8 R Comedy 119 324.59
## 7 The Grinch 6.3 PG Family 86 270.60
The rows of the datasets are called observational
units.
-Films are the observational units in this dataset.
The columns of the datasets are called variables.
A quantitative variable is one that takes on numeric
values
- Examples: IMDB, Runtime, Revenuw
A categorical variable is one where outcomes are a
set of categories
-Examples: Rating, Genre
Bar graphs are used to display frequencies for categorical variables.
Stacked bar graphs display information on 2 categorical variables such as Genre and Rating
Histograms and boxplots are used to display quantitative variables.
In a histogram, the x-axis contains numbers, rather than categories.
Movies with IMDB scores below 4
## Title IMDB Rating Genre Runtime Revenue
## 1 Thugs of Hindostan 3.9 Not Rated Action 164 1.45
## 2 Supercon 3.7 R Comedy 100 3.40
## 3 Show Dogs 3.6 PG Family 92 17.74
## 4 Holmes & Watson 3.5 PG-13 Comedy 90 30.57
## 5 Slender Man 3.2 PG-13 Horror 93 30.57
## 6 Race 3 2.1 Not Rated Thriller 160 1.69
Scatterplots display the relationship between two quantitative variables.
median(movies$IMDB)
## [1] 6.5
mean(movies$IMDB)
## [1] 6.408088
In roughly symmetric dataset, the mean and median are approximately the same.
This distribution is said to be right-skewed. Most movies made less than $50 million, but a few made much more, creating a “tail,” or long “whisker” going to the right.
median(movies$Revenue)
## [1] 6.605
mean(movies$Revenue)
## [1] 41.55228
In a right-skewed distribution, the mean is considerably larger than the median, since a few very large observations pull the mean up considerably, but don’t change the middle number in the dataset. In these situations, the median is usually a better indicator of a “typical” value.
In addition to graphics, we can use statistics to describe the amount of variability in a dataset.
Common Measures of Center:
Common Measures of Variability:
The higher the IQR or standard deviation, the more variability in the data.
| Genre | Mean | median | IQR | stdev |
|---|---|---|---|---|
| Comedy | 6.383721 | 6.5 | 1.15 | 0.9928147 |
| Drama | 6.731429 | 6.8 | 0.70 | 0.8112341 |
Example from Chapter 1 of Introduction to Statistical Investigations by Tintle et al.
Image from http://www.bbc.com
Pioneer research in marine biology (1960’s) studied whether dolphins could communicate with each other beyond relaying simple feelings.
Dr. Jarvis Bastian conducted experiment involving two dolphins, Doris and Buzz.
Image from Introduction to Statistical Investigation by Tintle et al.
Two buttons and a light were placed underwater. Dolphins trained to push right button when light shone steadily and left button when light blinked.
After mastering task, dolphins were separated by curtain placed through the center of the pool. Only Doris could see light, and only Buzz could push button, but they could hear each other’s sounds.
After seeing light, Doris would whistle to Buzz, who would press button. If Buzz pushed correct button, both dolphins were rewarded with fish.
Does this result provide evidence that the dolphins were actually communicating effectively?
Hypothesis 1:
Buzz was just guessing which button to push.
Hypothesis 2:
Buzz was not just guessing, and was using information from Doris (or
possibly another source).
Does this result provide evidence that the dolphins were actually communicating effectively?
Hypothesis 1: (Null Hypothesis)
Buzz was just guessing which button to push. (\(p=0.5\))
Hypothesis 2: (Alternative Hypothesis)
Buzz was not just guessing, and was using information from Doris (or
possibly another source). (\(p>0.5\))
The null hypothesis is the “by chance alone” explanation.
The alternative hypothesis is another explanation that contradicts the null hypothesis.
How likely is it that the dolphins would have gotten 15 or more attempts correct out of 16 if they were just guessing?
In general, we need to determine the probability of getting a result as extreme or more extreme than we did (i.e. 15 out of 16 correct) if the null hypothesis is true (that is the dolphins are randomly guessing).
We’ll do this by simulating a situation where the null hypothesis is true (a coin flip)
The following R code will simulate flipping a coin 16 times.
set.seed(09192018)
Flips <- sample(c("H", "T"), prob= c(0.5, 0.5), size=16, replace=TRUE)
Flips
## [1] "T" "T" "H" "T" "T" "H" "T" "H" "H" "T" "T" "T" "H" "T" "H" "H"
Number of heads:
sum(Flips == "H")
## [1] 7
Now, we’ll repeat simulating 16 flips 10,000 times, and keep track of the number of heads.
set.seed(09192018)
Heads <- c(rep(NA), 10000)
for( i in 1:10000){
Flips <- sample(c("H", "T"), prob= c(0.5, 0.5), size=16, replace=TRUE)
Heads[i] <- sum(Flips == "H")
}
Results <- data.frame(Heads)
SimDolphins <- gf_histogram(~Heads, data=Results,
bins=17, binwidth = 1,
border=0, fill="blue", color="black") +
geom_vline(xintercept=15, colour="red")
SimDolphins
In 10,000 simulations how many times did we get 15 or more heads?
sum(Heads >= 15)
## [1] 3
The probability of the dolphins getting 15 or more of the signals correct in 16 flips is approximately \(\frac{3}{10,000}=0.0003\).
There is strong evidence that the dolphins are not just guessing, and may indeed be communicating.
The p-value is the probability of obtaining a result as or more extreme than we did, when the null hypothesis is true.
Our simulated p-value is \(\frac{3}{10,000}=0.0003\).
The probability of the dolphins getting 15 or more attempts correct if they are just randomly guessing is approximately 0.0003.
small p-values provide evidence against the null hypothesis!
often, we reject the null hypothesis is the p-value is less than 0.05 (0.1 and 0.01 are also reasonable criteria)
Dolphins Example:
Since the p-value is very small, it is very unlikely that the dolphins would have gotten 15 or more correct if they were randomly guessing.
We reject the null hypothesis. There is strong evidence that the dolphins were not purely guessing.
What if Buzz had only been right on 10 of the 16 tries. Would this change our conclusion?
In 10,000 simulations how many times did we get 10 or more heads?
sum(Heads >= 10)
## [1] 2299
The probability of the dolphins getting 10 or more of the signals correct in 16 flips is approximately \(\frac{2,299}{10,000}=0.2299\).
It is plausible that the dolphins would have gotten 10 or more correct by randomly guessing, so in this situation, we would not have evidence that the dolphins are actually communicating. We would not reject the null hypothesis.
We will never accept or prove the null hypothesis. We can only evaluate the strength of the evidence against it.
To review,
In order to make general claims about dolphin communication, this (or similar) experiments need to be conducted on other dolphins than just Buzz and Doris. Subsequent research has provided evidence that dolphins are highly intelligent mammals, capable of high-level communication.
A 2011 study by Wood, titled “Babies Learn Early Who They Can Trust” examined the way that babies responded to adults who either correctly or falsely led them to believe there was something exciting in a box. (Data from Statistics: Unlocking the Power of Data by Lock et al.)
| Imitated | Did not imitate | Total | |
|---|---|---|---|
| Box Contained Toy (adult trustworthy) | 18 (0.60) | 12 (0.40) | 30 |
| No Toy (adult not trustworthy) | 10 (0.33) | 20 (0.67) | 30 |
Do you think these results provide evidence of a difference in the way the babies respond?
Explanation 1:
Whether or not a baby presses the light has nothing to do with the adult’s behavior. More of the babies who were inclined to press the button happened to be assigned to the group with “trustworthy adults,” by pure chance.
Explanation 2:
The observed difference between the groups is due to more than just pure chance. (Perhaps explained by the behavior or “trustworthiness” of the adult.)
Define:
\(p_1\): proportion of all babies who would imitate a “trustworthy” adult.
\(p_2\): proportion of all babies who would imitate a “non-trustworthy” adult.
Null Hypothesis: There is no difference between the proportion of all babies who would imitate “trustworthy” and “non-trustworthy” adults (\(p_1=p_2\) or \(p_1-p_2=0\))
Alternative Hypothesis: Babies are more likely to imitate “trustworthy” adults than non-trustworthy ones. (\(p_1>p_2\) or \(p_1-p_2>0\))
Sample statistics: \(\hat{p}_1=\frac{18}{30}=0.6\), \(\hat{p}_2=\frac{10}{30}\approx0.3333\)
\(\hat{p}_1-\hat{p}_2\approx0.2667\)
How likely is it that we would get a difference in proportions as large as 0.2667 if there is really no difference in the way babies respond to “trustworthy” and “non-trustworthy” adults?
How might we simulate a situation where there is really no difference in the way the babies respond?
## [[1]]
##
## $`Observed Difference in Proportions`
## [1] 0.2666667
##
## $`Simulation-based p-value`
## [1] 0.025
The p-value represents the probability of getting a difference in sample proportions as large as 0.2667 if there is really no difference between the proportion of all babies who would imitate “trustworthy” and “non-trustworthy” adults.
Since our observed difference is extreme and the p-value is low (about 0.03), our results are not consistent with the hypothesis that there is no difference. The data provide evidence that the difference between the groups is not due to chance alone.
The extent to which we attribute this to the “trustworthiness” of the adult depends on the nature of the experiment. Since “trustworthiness” was established in a very specific way, we should be careful not to overgeneralize.
The strength of evidence against a null hypothesis is depends on both:
the difference between the observed statistic \(\hat{p}\) and the hypothesized value \(p\).
the sample size.
Null Hypothesis: For all AP test questions, probability that B is correct is 1/5 = 0.2 (\(p=0.2\))
Alternative Hypothesis: For all AP test questions, probability that B is correct is greater than 0.2 (\(p>0.2\))
As the difference between the hypothesized and true values increases, the evidence against the null hypothesis gets stronger. The p-value gets smaller.
As the sample size increases, the evidence against the null hypothesis gets stronger. The p-value gets smaller.
What do you notice about the shape of the sampling distributions we’ve seen so far?
In many situations, the sampling distribution for a proportion (or a difference in proportions) can be approximated by a symmetric, bell-shaped curve, known as a normal distribution.
Image from Intro Statistics with Simulation and Randomization by Deitz, Barr, Cetinkaya-Rundel
A standardized score, or z-score is useful for comparing our sample statistic \(\hat{p}\) to the hypothesized value \(p\).
A z-score tells us how many standard deviations a statistic lies away from its hypothesized value.
\(z=\frac{\text{Statistic}-\text{Hypothesized Value}}{\text{Standard Deviation of Statistic}}\)
The standard deviation of a sample statistic is sometimes called the standard error.
z-scores more extreme than \(\pm 2\) typically provide evidence against the null hypothesis.
\(z=\frac{\text{Statistic}-\text{Hypothesized Value}}{\text{Standard Error}}\)
\(\text{Standard Error} = \sqrt{\frac{p(1-p)}{n}}\)
Where \(p\) is the hypothesized value, and \(n\) is sample size.
Thus,
\(z=\frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}}\)
Null Hyp: The proportion of of all US adults who favor riding in self-driving cars is 0.25 (\(p=0.25\)).
Alt. Hyp: The proportion of all US adults who favor riding in self-driving cars different than 0.25 (\(p \neq 0.25\)).
\(\hat{p}=\frac{758}{3297}=0.23\), and \(n=3297\)
\(z=\frac{0.23-0.25}{\sqrt{\frac{0.25(1-0.25)}{3297}}}=-2.6456\)
The sample statistic we observed is 2.6 standard errors lower than we would have expected if the null hypothesis is true. This provides evidence against the null hypothesis.
p-value:
## [1] 0.003854433
The prop.test command in R can be used to get the p-value from the test.
x = number of “successes”
n= sample size
p= hypothesized parameter quantity in null hypothesis
alternative= either “less” , “greater” or “two.sided” depending on
alternative hypothesis
prop.test(x=758, n=3297, p=0.25, alternative="less", conf.level = 0.95, correct=FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 758 out of 3297, null probability 0.25
## X-squared = 7.0999, df = 1, p-value = 0.003854
## alternative hypothesis: true p is less than 0.25
## 95 percent confidence interval:
## 0.0000000 0.2421781
## sample estimates:
## p
## 0.229906
Note that the z-statistic we calculated is the square root of the X-squared value in the output.
## [[1]]
##
## $`Observed Proportion`
## [1] 0.229906
##
## $`Simulation-based p-value`
## [1] 0.0036
A confidence interval tells us a range in which a parameter could reasonably lie.
An approximate 95% confidence interval is given by
\(\text{Statistic} \pm 2 \times \text{Standard Error}\)
An approximate 95% confidence interval for a proportion \(p\) is:
\(\hat{p} \pm 2 \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
A 95% confidence interval for the proportion of all US adults who would likely ride in a self-driving car is:
\(0.23 \pm 2\times \sqrt{\frac{0.23(1-0.23)}{3297}} = 0.23 \pm 0.015.\)
\(0.23-0.015=0.215\) and \(0.23+0.015=0.245\)
We can calculate confidence interval directly using the prop.test() function in R.
Make sure to set alternative=“two.sided” when making a confidence interval.
prop.test(x=758, n=3297, conf.level = 0.95, alternative="two.sided", correct=FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 758 out of 3297, null probability 0.5
## X-squared = 962.07, df = 1, p-value < 0.00000000000000022
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.2158625 0.2445781
## sample estimates:
## p
## 0.229906
We are 95% confident that the proportion of all US adults who would want to ride in a self-driving car is between 0.216 and 0.245.
Null Hypothesis: There is no difference between the proportion of all babies who would imitate “trustworthy” and “non-trustworthy” adults (\(p_1=p_2\) or \(p_1-p_2=0\))
Alternative Hypothesis: Babies are more likely to imitate “trustworthy” adults than non-trustworthy ones. (\(p_1>p_2\) or \(p_1-p_2>0\))
Sample statistics: \(\hat{p}_1=\frac{18}{30}=0.6\), \(\hat{p}_2=\frac{10}{30}\approx0.33\)
\(z=\frac{\text{Statistic}-\text{Hypothesized Value}}{\text{Standard Error}}\)
\(z=\frac{(\hat{p}_1-\hat{p}_2)-0}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}}\)
Where \(\hat{p}\) is the proportion of overall successes when the groups are combined, and \(n\) is sample size.
Babies example:
\(z=\frac{\frac{18}{30}-\frac{10}{30}}{\sqrt{\left (\frac{18+10}{30+30}\right)\left(1-\left (\frac{18+10}{30+30}\right)\right)\left(\frac{1}{30}+\frac{1}{30}\right)}}\approx2.07\)
The sample statistic we observed is 2.07 standard errors larger than we would have expected if the null hypothesis is true. This provides evidence against the null hypothesis.
p-value:
## [1] 0.01921695
prop.test(x=c(18, 10), n=c(30,30), alternative="greater", correct=FALSE)
##
## 2-sample test for equality of proportions without continuity correction
##
## data: c(18, 10) out of c(30, 30)
## X-squared = 4.2857, df = 1, p-value = 0.01922
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.06249661 1.00000000
## sample estimates:
## prop 1 prop 2
## 0.6000000 0.3333333
## [[1]]
##
## $`Observed Difference in Proportions`
## [1] 0.2666667
##
## $`Simulation-based p-value`
## [1] 0.025
\(\text{Statistic} \pm 2 \times \text{Standard Error}\)
\((\hat{p}_1-\hat{p}_2)\pm 2\times\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\)
An approximate 95% confidence interval for the difference in proportion of babies who would imitate “trustworthy” vs “non-trustworthy” adults is:
\[(\hat{p}_1-\hat{p}_2)\pm 2\times\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\]
\[ \begin{aligned} &(\hat{p}_1-\hat{p}_2)\pm 2\times\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)} \\ &=\left(\frac{18}{30}-\frac{10}{30}\right)\pm 2\times\sqrt{\frac{18}{30} \left(1-\frac{18}{30}\right) \left(\frac{1}{30}+\frac{1}{30}\right)}\\ &=0.2667\pm2\times\sqrt{0.008}\\ &= 0.2667 \pm 2\times0.179 \end{aligned} \]
prop.test(x=c(18, 10), n=c(30,30), alternative="two.sided", correct=FALSE)
##
## 2-sample test for equality of proportions without continuity correction
##
## data: c(18, 10) out of c(30, 30)
## X-squared = 4.2857, df = 1, p-value = 0.03843
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.02338304 0.50995029
## sample estimates:
## prop 1 prop 2
## 0.6000000 0.3333333
We are 95% confident that the proportion of babies who would imitate to “trustworthy” adults is between 0.02 and 0.51 higher than for “non-trustworthy” adults.
When the sample size is very small, the normal approximation is often inappropriate.
This is especially a concern when the hypothesized value of \(p\) is close to 0 or 1.
Examples:
Guideline: In order to use the normal approximation there should be at least 10 “successes” and 10 “failures” in each group.
A 2004 study by Lange, T., Royals, H. and Connor, L. examined Mercury accumulation in large-mouth bass, taken from a sample of 53 Florida Lakes. If Mercury accumulation exceeds 0.5 ppm, then there are environmental concerns. In fact, the legal safety limit in Canada is 0.5 ppm, although it is 1 ppm in the United States.
## # A tibble: 6 × 2
## Lake AvgMercury
## <chr> <dbl>
## 1 Alligator 1.23
## 2 Annie 1.33
## 3 Apopka 0.04
## 4 Blue Cypress 0.44
## 5 Brick 1.2
## 6 Bryant 0.27
The histogram shows the distribution of Mercury levels for the 53 lakes.
| Mean | SD |
|---|---|
| 0.5271698 | 0.3410356 |
| Mean | SD |
|---|---|
| 0.5271698 | 0.3410356 |
Do these data provide enough evidence to say that the mean mercury concentration for all Florida Lakes is higher than 0.5?
How unusual would it be to get a sample of 53 lakes whose mean mercury concentration is 0.027 higher than the overall average?
To answer this, we need to understand the behavior of sample mean for samples of 53 lakes.
Distribution of Hg Concentration in Individual Lakes:
Distribution of Sample Means: (sampling distribution for the
mean)
As sample size increases:
It is reasonable to approximate the distribution of sample means with a symmetric, bell-shaped t-distribution when:
Distribution of revenues for 2018 movies:
Distribution of Sample Means: (sampling distribution for the
mean)
As sample size increases:
| Mean | SD |
|---|---|
| 0.5271698 | 0.3410356 |
Do these data provide enough evidence to say that the mean mercury concentration for all Florida Lakes is higher than 0.5?
How unusual would it be to get a sample of 53 lakes whose mean mercury concentration is 0.027 higher than the overall average?
Parameter of interest: mean mercury level for all lakes in Florida (\(\mu\))
Null Hypothesis: The mean mercury level for all lakes in Florida is 0.5 ppm. (\(\mu=0.5\))
Alternative Hypothesis: The mean mercury level for all lakes in Florida exceeds 0.5 ppm. (\(\mu>0.5\))
Sample Statistic: \(\bar{x}=0.527\)
| Categorical | Quantitative | |
|---|---|---|
| Data (outcome variable) | A category | A number |
| Parameter of interest | unknown long-run proportion (\(p\)) | unknown overall mean (\(\mu\)) |
| Sample statistic | prop. from sample (\(\hat{p}\)) | sample mean (\(\bar{x}\)) |
standardized statistic: \(t= \frac{\bar{x}-\mu}{s\mathbin{/}\sqrt{n}}\)
where:
\(\bar{x}\) is the sample mean
\(\mu\) is the value from the null
hypothesis
\(s\) is the sample standard
deviation
\(n\) is the sample size
The quantity \(\frac{s}{\sqrt{n}}\) is called the standard error of \(\bar{x}\).
A confidence interval for \(\mu\) is given by:
\(\bar{x}\pm 2\times \frac{s}{\sqrt{n}}\)
For approximate 95% confidence, use \(m=2\)
Recall the mean and standard deviation in the sample of 53 Florida Lakes.
| Mean | SD | n |
|---|---|---|
| 0.5271698 | 0.3410356 | 53 |
When testing the null hypothesis \(\mu=0.5\), the standardized statistic is:
\(t= \frac{\bar{x}-\mu}{s\mathbin{/}\sqrt{n}}=\frac{0.527-0.5}{0.341\mathbin{/}\sqrt{53}}=0.58\).
The \(t\)-statistic is consistent with what we would expect if the mean level mercury level for all Florida lakes was 0.5 ppm. It is plausible that the mean mercury level is 0.5 ppm.
pt(q=0.58, df=52, lower.tail = FALSE)
## [1] 0.2822097
t.test(x=FloridaLakes$AvgMercury, mu=0.5, conf.level=0.95, alternative="greater")
##
## One Sample t-test
##
## data: FloridaLakes$AvgMercury
## t = 0.58, df = 52, p-value = 0.2822
## alternative hypothesis: true mean is greater than 0.5
## 95 percent confidence interval:
## 0.4487193 Inf
## sample estimates:
## mean of x
## 0.5271698
The probability of observing a sample mean as large as 0.58 ppm if the true mean mercury concentration for all Florida lakes was really 0.50 ppm is 0.28 (pretty high).
We do not reject the null hypothesis. There is not evidence to conclude that the mean mercury concentration for all Florida Lakes is more than 0.5 ppm.
t.test(x=FloridaLakes$AvgMercury, mu=0.5, conf.level=0.95, alternative="two.sided")
##
## One Sample t-test
##
## data: FloridaLakes$AvgMercury
## t = 0.58, df = 52, p-value = 0.5644
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
## 0.4331688 0.6211709
## sample estimates:
## mean of x
## 0.5271698
We are 95% confident that the mean mercury concentration in all Florida lakes is between 0.43 and 0.62 ppm.
Is average mercury level different for lakes in Northern Florida than Southern Florida?
from Google Maps
Null Hypothesis: There is no difference in the mean mercury levels between lakes in Northern and Southern Florida (\(\mu_1=\mu_2\))
Alternative Hypothesis: There is a difference in the mean mercury levels between lakes in Northern and Southern Florida.(\(\mu_1\neq\mu_2\))
gf_boxplot(data=FloridaLakes, AvgMercury ~ Location) %>% gf_refine(coord_flip())
| Location | MeanHg | StDevHg | N |
|---|---|---|---|
| N | 0.4245455 | 0.2696652 | 33 |
| S | 0.6965000 | 0.3838760 | 20 |
How unusual would it be to observe a difference in sample means as large as \(0.6965 - 0.4245 = 0.2720\) if mercury concentrations are really the same in the North as in the South?
We can use a t-test since:
1. Data are roughly symmetric
2. Sample sizes are reasonably large (33 and 20)
3. Samples are independent
A test statistic is:
\(t=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}\)
A confidence interval for \(\mu_1-\mu_2\) is given by:
\((\bar{x}_1-\bar{x}_2) \pm m\times{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}\)
For a 95% confidence interval \(m=2\)
t.test(data=FloridaLakes, AvgMercury~Location, conf.level=0.95, alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: AvgMercury by Location
## t = -2.7797, df = 30.447, p-value = 0.009239
## alternative hypothesis: true difference in means between group N and group S is not equal to 0
## 95 percent confidence interval:
## -0.4716369 -0.0722722
## sample estimates:
## mean in group N mean in group S
## 0.4245455 0.6965000
The low p-value of 0.009239 tells us that there is strong evidence against the null hypothesis. We have reason to believe that the average mercury concentration in lakes is not the same as in southern lakes.
We are 95% confident that the average mercury concentration in northern lakes is between 0.47 and 0.07 parts per million less than in southern lakes.
Image from Introduction to Statistical Investigations by Tintle et al.
Null Hypothesis: There is no difference between average running times using the wide and narrow angles.
Alternative Hypothesis: There is a difference between average running times using the wide and narrow angles.
How is this question similar to the comparison of mercury levels in lakes in Northern vs Southern Florida? How is it different?
To this point, we have assumed that we are working with independent data.
In the lakes example, we observed a different set of lakes in Northern Florida than in Southern Florida. The samples were independent.
Here, we observed the same runners twice, once using each type of angle. We expect runners who are faster using one angle to also be faster using the other angle, so the samples are not independent.
Do you think the data provide evidence that one running strategy is better than the other? Why or why not?
| Angle | mean | St.Dev | n |
|---|---|---|---|
| narrow | 5.534091 | 0.2597555 | 22 |
| wide | 5.459091 | 0.2728319 | 22 |
Average difference: 5.534-5.459 = 0.075 sec.
We now match the times of each runner individually.
Does this information change your thoughts on whether there is evidence that one strategy is better? Why or why not?
When we have multiple observations on the same subjects, we find the difference between each subject individually, and then perform a 1-sample t-test on the differences.
| Runner | narrow | wide | Difference |
|---|---|---|---|
| 1 | 5.50 | 5.55 | -0.05 |
| 2 | 5.70 | 5.75 | -0.05 |
| 3 | 5.60 | 5.50 | 0.10 |
| 4 | 5.50 | 5.40 | 0.10 |
| 5 | 5.85 | 5.70 | 0.15 |
| 6 | 5.55 | 5.60 | -0.05 |
t.test(x=Baserunners$narrow, y=Baserunners$wide, paired = TRUE)
##
## Paired t-test
##
## data: Baserunners$narrow and Baserunners$wide
## t = 3.9837, df = 21, p-value = 0.0006754
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.03584814 0.11415186
## sample estimates:
## mean difference
## 0.075
The p-value represents probability of observing a mean difference as extreme or more extreme than 0.075 if there is really no difference between average times using narrow and wide angles.
Because the p-value is very small (0.0006754), we have strong evidence of differences in running times.
We are 95% confident that it takes between 0.036 and 0.11 seconds longer to run using the narrow approach than the wide approach.
THIS IS NOT THE CORRECT WAY TO ANALYZE THESE DATA!!!
t.test(x=Baserunners$narrow, y=Baserunners$wide)
##
## Welch Two Sample t-test
##
## data: Baserunners$narrow and Baserunners$wide
## t = 0.93383, df = 41.899, p-value = 0.3557
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.08709334 0.23709334
## sample estimates:
## mean of x mean of y
## 5.534091 5.459091
Test for paired differences:
| mean | St.Dev | n |
|---|---|---|
| 0.075 | 0.0883041 | 22 |
.
.
.
.
\(t=\frac{\bar{x}_d}{\frac{s_d}{\sqrt{n}}} =
\frac{0.075}{\frac{0.0883}{\sqrt{22}}}\approx 3.98\)
This is the correct approach.
t-Test for Two Independent Samples:
| Angle | mean | St.Dev | n |
|---|---|---|---|
| narrow | 5.534091 | 0.2597555 | 22 |
| wide | 5.459091 | 0.2728319 | 22 |
\(t=\frac{(\bar{x}_1-\bar{x}_2)}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}=\frac{(5.534-5.456)}{\sqrt{\frac{0.2598^2}{22}+\frac{0.2728^2}{22}}}\approx0.93\)
THIS IS NOT THE CORRECT WAY TO ANALYZE THESE DATA!!!
The independent t-test overestimates the amount of variability.
- Considers differences between runners, rather than the angles
Paired t-test (Correct)
p-value:
## [1] 0.0006815105
\ \
t-Test for Two Independent Samples: (Incorrect)
p-value:
## [1] 0.3576859
Should we use procedures for paired data, or independent data?
We are interested in testing whether a yoga class improves flexibility. 25 people participate in the class. Paticipants’ flexibility scores are recorded before and after participating in the class.
We are interested in testing whether listening to music impacts concentration. A sample of 80 college students is randomly divided into two groups. Both groups read the same passage from a textbook, but one group reads it while listening to music and the other reads it in silence. Students then take a quiz to see what they remembered and quiz scores are compared.
We are interested in testing whether a new regulation has had an impact on carbon emissions. We collect data on 50 different factories and record their carbon emissions the year before and the year after the regulation was passed.
We are interested in assessing whether mercury contamination levels in lakes differ between the summer and fall. We visit a sample of 53 lakes and measure their mercury levels in July, and then revisit the same lakes in October and measure the mercury levels again.
A school board wants to determine whether there is a difference in average ACT scores between two scores in the district. They take a simple random sample of 100 ACT takers students from each school, and compare their scores.
More data have been recorded in the last two years than all previous human existence (Forbes magazine)
Data are used to:
With great power comes great responsibility
Example from Introduction to Statistical Investigations by Tintle et al.
All Patients:
| Survived | Died | |
|---|---|---|
| Hospital A | 800 | 200 |
| Hospital B | 900 | 100 |
We now break down survial rates by health of patients at the time of admission to the hospital .
Patients in good condition (non-life threatening illness or injury):
| Survived | Died | |
|---|---|---|
| Hospital A | 590 (98.3%) | 10 (1.7%) |
| Hospital B | 870 (96.7%) | 30 (3.3%) |
\ \
Patients in poor condition (serious, possibly life-threatening condition):
| Survived | Died | |
|---|---|---|
| Hospital A | 210 (52.5%) | 190 (47.5%) |
| Hospital B | 30 (30%) | 70 (70%) |
Although the overall survival rate is higher at Hospital B, Hospital A has higher survival rates for patients who are in both good and poor health.
Regardless of a patient’s condition, they have a better chance for survival at Hospital A.
The fact that Hospital B has a higher overall survival rate is due to the fact that most of its patients are in good health upon admission, while a high percentage of Hospital A’s patients are admitted in poor health.
The hospital example is an instance of Simpson’s Paradox
Simpson’s paradox occurs when an overall trend appears to “reverse” when data are broken down into subgroups or categories.
Simpson’s paradox has appeared in data involving:
1. College admissions
2. Medical data
3. Sports statistics
4. Court convictions
and many more.
A 2008 article from Newscientist.com ran the headline “Breakfast Cereals Boost the Chances of Conceiving Boys”(https://www.newscientist.com/article/dn13754-breakfast-cereals-boost-chances-of-conceiving-boys/)
Note: In addition to statistical fallacies, this article contains problematic assumptions about gender, among other topics. The purpose of examining this article is to illuminate the real and troubling ways in which misuse and misrepresentation of data can be harmful, especially when combined with problematic assumptions in society.
The article claims that women who eat breakfast cereal before becoming pregnant, are significantly more likely to conceive boys. The researchers kept track of 133 foods and, for each food, tested whether there was a difference in the proportion conceiving boys between women who ate the food and women who did not. Of all the foods, only breakfast cereal resulted in a p-value less than 0.01.
Should we conclude that women who eat cereal are more likely to conceive boys? Explain.
The breakfast cereal conclusion demonstrates a pitfall, known as multiple testing error.
In fact, if we test 100 different foods, then we would expect 5 of them to yield a p-value less than 0.05 and 1 to yield a p-value less than 0.01 just by chance.
In order to correct for this, researchers should use a lower cutoff value (level of significance) in order to reject the null hypothesis.
One way to correct for multiple testing error is the Bonferroni correction. Instead of using 0.05, use \(\frac{0.05}{\# \text{ tests}}\)
Example: In the breakfast cereal example, only reject null hypothesis if p-value \(\frac{0.05}{133}=0.000376\)
We consider a data on a random sample of 80 babies born in North Carolina in 2004. Thirty were born to mothers who were smokers, while fifty were born to mothers who were nonsmokers.
We are interested in studying whether there is evidence of a difference in average birthweight between babies born to smokers and nonsmokers.
| habit | Mean_Weight | SD | n |
|---|---|---|---|
| nonsmoker | 7.039200 | 1.709388 | 50 |
| smoker | 6.616333 | 1.106418 | 30 |
Let \(\mu_1\) represent mean birthweight for babies with mothers are nonsmokers.
Let \(\mu_2\) represent mean birthweight for babies with mothers are smokers.
Null Hypothesis: There is no difference between mean birthweight for babies with mothers who smoke, compared to babies with mothers who do not. (\(\mu_1=\mu_2\))
Alternative Hypothesis: There is a difference between mean birthweight for babies with mothers who smoke, compared to babies with mothers who do not. (\(\mu_1\neq\mu_2\))
Since samples are not too small (both \(\geq 30\)) and not heavily skewed, t-distribution is appropriate.
t.test(data=NCBirths, weight~habit, alternative="two.sided", conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: weight by habit
## t = 1.3423, df = 77.486, p-value = 0.1834
## alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
## 95 percent confidence interval:
## -0.2043806 1.0501140
## sample estimates:
## mean in group nonsmoker mean in group smoker
## 7.039200 6.616333
Do the data provide evidence that of differences in birthweights for babies born to smokers, compared to nonsmokers?
Many studies have shown that a mother’s smoking puts a baby at risk of low birthweight. Do our results contradict this research? Why or why not?
Notice that we observed a difference of about 0.5 lbs. in mean birthweight. A 0.5 lb. difference is considerable for baby weights.
It would be highly inappropriate to say that our data suggest that there is no difference in birthweights for babies of smokers, compared to nonsmokers.
Our large p-value is due to the fact that our sample size is too small. It does NOT suggest that there is no difference.
This is yet another example of why we should never “accept the null hypothesis” or say that our data “support the null hypothesis.”
In fact, this sample of 80 babies is part of a larger dataset, consisting of 1,000 babies born in NC in 2004. When we consider the full dataset, notice that the difference between the groups is similar, but the p-value is much smaller.
t.test(data=ncbirths, weight~habit, alternative="two.sided", conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: weight by habit
## t = 2.359, df = 171.32, p-value = 0.01945
## alternative hypothesis: true difference in means between group nonsmoker and group smoker is not equal to 0
## 95 percent confidence interval:
## 0.05151165 0.57957328
## sample estimates:
## mean in group nonsmoker mean in group smoker
## 7.144273 6.828730
A traveler lives in New York and wants to fly to Chicago. They consider flying out of two New York airports:
We have data on the times of flights from both airports to Chicago’s O’Hare airport from 2013 (more than 14,000 flights).
Assuming these flights represent a random sample of all flights from these airports to Chicago, consider how the traveler might use this information to decide which airport to fly out of.
| origin | Mean_Airtime | SD | n |
|---|---|---|---|
| EWR | 113.2603 | 9.987122 | 5828 |
| LGA | 115.7998 | 9.865270 | 8507 |
Let \(\mu_1\) represent mean airtime for flights out of Newark (EWR).
Let \(\mu_2\) represent mean airtime for flights out of LaGuardia (LGA).
Null Hypothesis: Mean flight time from LaGuardia to O’Hare is the same as from Newark to O’Hare (\(\mu_1=\mu_2\))
Alternative Hypothesis: Mean flight time to O’Hare differs between the two New York airports (\(\mu_1\neq\mu_2\))
Because sample sizes are large (much greater than 30), we can use the t-distribution.
t.test(data=Flights1, air_time~origin, alternative="two.sided", conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: air_time by origin
## t = -15.028, df = 12419, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group EWR and group LGA is not equal to 0
## 95 percent confidence interval:
## -2.870747 -2.208287
## sample estimates:
## mean in group EWR mean in group LGA
## 113.2603 115.7998
Do the data provide evidence that mean flight time to O’Hare differs between the two New York airports?
How important would this information be for you when deciding which New York airport to fly out of?
The low p-value gives us strong evidence of a difference in mean flight times between the two New York airports.
We can be 95% confidence that average flight time to Chicago O’Hare is between 2.2 to 2.9 minutes faster for flights out of Newark, than flights out of LaGuardia.
In reality, this difference is practically meaningless.
The low p-value is due to the very large sample size.
A p-value only tells us part of the story.
low p-value tells us a difference would not likely have occurred by chance
does not tell us size of difference or whether it is meaningful
when sample size is large, even small differences yield small p-values
p-values are only (a small) part of a statistical analysis.
We will now consider situations where we wish to compare two quantitative variables. For example
How is the price of a car related to the amount of time it takes a car to accelerate from 0 to 60 miles per hour?
The variable we are interested in predicting is called the response variable and is plotted on the y-axis. (price)
The variable we are using to make predictions is called the explanatory variable (or predictor variable) and is plotted on the x-axis. (acceleration time)
Such problems can be investigated using linear regression models.
When describing relationships between quantitative variables, we should consider:
The correlation coefficient measures the strength and direction between two variables.
cor(SmallCars$LowPrice, SmallCars$Acc060)
## [1] -0.8230937
There is a fairly strong negative linear association between accceleration time and price.
A correlation between two variables does not imply there is a causal relationship.
Correlation only describes the strength of a linear relationship. \(r=0\) means there is no linear association, but there could still be a nonlinear relationship.
Example:
The Corr/Regression applet illustrates how the line of best fit is chosen.
A residual is the difference between the observed response and predicted response.
Residual = Observed - Predicted
The line of best fit is determined by minimizing the sum of the squared residuals.
The regression equation has the form \(y=b_0+b_1x\).
\(b_0\) represents the y-intercept of the regression line. This is the expected value of the response variable when the explanatory varaible is 0.
In this case, that would be the expected price of a car that can accelerate from 0 to 60 mph in 0 seconds.
\(b_1\) represents the slope of the regression line. This is the expected change in the response variable when the explanatory variable increases by 1 unit.
In this case, that would be expected change in price for each additional second it takes a car to accelerate from 0 to 60 mph.
M <- lm(data=SmallCars, LowPrice~Acc060)
summary(M)
##
## Call:
## lm(formula = LowPrice ~ Acc060, data = SmallCars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.487 -5.921 0.855 3.404 30.141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.3886 5.5566 14.287 < 0.0000000000000002 ***
## Acc060 -6.1535 0.6329 -9.723 0.00000000000125 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.267 on 45 degrees of freedom
## Multiple R-squared: 0.6775, Adjusted R-squared: 0.6703
## F-statistic: 94.53 on 1 and 45 DF, p-value: 0.000000000001246
The regression equation is:
\(\text{Expected Price} = 79.39 - 6.1535 \times \text{Acceleration Time}\)
Interpretation of y-intercept
The expected price of a car that can accelerate from 0 to 60 miles per hour in 0 seconds is 79.39 thousand dollars. This does not make sense in context, so the intercept does not have a meaningful interpretation in this situation.
Interpretation of slope The expected price of a car is expected to decrease by 6.15 thousand dollars for each additional second it takes to accelerate from 0 to 60 mph.
Coefficient of Determination The value “Multiple R-squared” = 0.6775 means that 67.75% of the variation in price is explained by our model based on acceleration time.
The regression equation is:
\(\text{Expected Price} = 79.39 - 6.1535 \times \text{Acceleration Time}\)
Examples:
The expected price of a small car that takes 8 seconds to accelerate from 0 to 60 mph is \(79.39-6.1535\times 8 = 30.162\) thousand dollars.
The expected price of a small car that takes 10 seconds to accelerate from 0 to 60 mph is \(79.39-6.1535\times 10 = 17.855\) thousand dollars.
We should not attempt to predict the price of a small car that takes 15 seconds to accelerate, since 15 lies outside the range of our data. This is called extrapolation, which is dangerous.
We have data on the height (in inches), and footlength (in cm.) for a sample of 20 people.
Measurements for the first 6 people are shown.
## footlength height
## 2 32 74
## 3 24 66
## 4 29 77
## 5 30 67
## 6 24 56
## 7 26 65
Slope:
## [1] 1.033259
On average, a 1 cm increase in footlength is associated with a 1.03 inch increase in height.
We need to determine whether we could have plausibly obtained a slope as extreme as 1.03 by chance, when there was really no relationship.
Null Hypothesis: There is no relationship between height and footlength (slope=0).
Alternative Hypothesis: There is a relationship between height and footlength. (slope\(\neq\) 0)
How might we simulate a dataset where there is no relationship between height and footlength?
| footlength | height | ShuffledHeight | |
|---|---|---|---|
| 2 | 32 | 74 | 77 |
| 3 | 24 | 66 | 65 |
| 4 | 29 | 77 | 71 |
| 5 | 30 | 67 | 66 |
| 6 | 24 | 56 | 65 |
| 7 | 26 | 65 | 64 |
Slope:
## [1] 0.01330377
This Rossman-Chance applet can be used to repeatedly simulate shuffled data.
Proportion of simulations with slope exceeding 1.03:
## [1] 0.0008
The probability of observing a slope as extreme as 1.03 if there is really no relationship between height and footlength is 0.0008.
There is strong evidence of a relationship between height and footlength.
Calculate and interpret test statistics and p-values for the slope of a regression line, using R-output.
Calculate and interpret a confidence interval for the slope of a regression line , using R-output.
We saw that the simulation-based null distribution for the slope is symmetric and roughly bell-shaped, so we can approximate it using a t-distribution.
We calculate a t-statistic using the formula
\(t=\frac{\text{slope}}{\text{Standard Error(Slope)}}\)
We will get these quantities from R.
M <- lm(height~footlength, data=FootHeight)
summary(M)
##
## Call:
## lm(formula = height ~ footlength, data = FootHeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1003 -2.2251 -0.7833 2.1330 8.7334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.3021 6.9050 5.547 2.89e-05 ***
## footlength 1.0333 0.2406 4.294 0.000437 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.613 on 18 degrees of freedom
## Multiple R-squared: 0.506, Adjusted R-squared: 0.4786
## F-statistic: 18.44 on 1 and 18 DF, p-value: 0.0004367
We test whether there is evidence of a relationship between footlength and height.
Null hypothesis: There is no relationship between height and footlength. (slope=0)
Alternative hypothesis: There is a relationship between height and footlength. (slope\(\neq 0\))
t-statistic: 4.294
p-value: 0.000437
The p-value tells us the probability of observing a slope as extreme as 1.03 if there is really no relationship between height and footlength.
Since this p-value is extremely low, we have strong evidence of a relationship between height and footlength.
confint(M, "footlength")
## 2.5 % 97.5 %
## footlength 0.5277435 1.538775
We can be 95% confident that a 1 cm. increase in the length of a person’s foot is associated with an increase in height between 0.53 and 1.54 inches.
predict(M, data.frame(footlength=32))
## 1
## 71.36641
predict(M, data.frame(footlength=50))
## 1
## 89.96508
We should be careful to only make predictions inside the range of values for the explanatory variable that are available in the dataset. Going outside this range is called extrapolation and can lead to nonsensical predictions.
For a response variable (Y) and explanatory variable (X):
A confidence interval tells us a reasonable range for the average value of Y for all observations with the given value of X.
- Example: Estimate the average height of all people with footlength 32 cm.
A prediction interval tells us a reasonable range for the value of Y for an individual with the given value of X.
- Example: Estimate the height of my neighbor, who I know has a footlength of 32 cm.
A prediction interval must account for both variability associated with the selection of the sample, and variability between individuals. A confidence interval only needs to account for variability associated with the selection of the sample
predict(M, newdata=data.frame(footlength=32), conf.level=0.95, interval="confidence")
## fit lwr upr
## 1 71.36641 68.91453 73.81829
We are 95% confident that the average height for all people with footlength 32 cm. is between 68.9 and 73.8 inches tall.
predict(M, newdata=data.frame(footlength=32), conf.level=0.95, interval="prediction")
## fit lwr upr
## 1 71.36641 63.3891 79.34372
We are 95% confident an individual with footlength of 32 cm. will be between 63.4 and 79.3 in. tall.
It is not always appropriate to use simple linear regression (that is regression with one explanatory variable) to model the relationship between two quantitive variables.
The following slides illustration where simple linear regression is not appropriate.
A nonlinear pattern indicates that a more complicated model should be used, rather than a simple linear regression model.
A “funnel shape” indicates a lack of constant variability, which will throw off confidence intervals and hypothesis tests associated with a simple linear regression model.
An outlier can throw off the general trend in the data
It is also inappropriate to use simple linear regression if certain observations are more highly correlated with one another than others.
For example:
All of these require more complicated models that account for correlation using spatial and time structure.
Exam 1 vs Exam 2 scores for intro stat students at another college
What is the relationship between scores on the two exams?
Exam 1 vs Exam 2 scores for intro stat students at another college
How many of the 6 students who scored below 70 on Exam 1 improved their scores on Exam 2?
How many of the 7 students who scored above 90 improved on Exam 2?
A low score on an exam is often the result of both poor preparation and bad luck.
A high score often results from both good preparation and good luck.
While changes in study habits and preparation likely explain some improvement in low scores, we would also expect the lowest performers to improve simply because of better luck.
Likewise, some of the highest performers may simply not be as lucky on exam 2, so a small dropoff should not be interpreted as weaker understanding of the exam material.
This simulation shows that the lowest scorers often improve, while the highest scorers often dropoff by chance alone.
This phenomon is called the regression effect.
Wins by NFL teams in 2017 and 2018
A 1973 article by Kahneman, D. and Tversky, A., “On the Psychology of Prediction,” Pysch. Rev. 80:237-251 describes an instance of the regression effect in the training of Israeli air force pilots.
Trainees were praised after performing well and criticized after performing badly. The flight instructors observed that “high praise for good execution of complex maneuvers typically results in a decrement of performance on the next try.”
Kahneman and Tversky write that :
“We normally reinforce others when their behavior is good and punish them when their behavior is bad. By regression alone, therefore, they [the trainees] are most likely to improve after being punished and most likely to deteriorate after being rewarded. Consequently, we are exposed to a lifetime schedule in which we are most often rewarded for punishing others, and punished for rewarding.”
We’ll now look at a dataset containing education data on all 50 states. Among the variables are average SAT score, average teacher salary, and fraction of students who took the SAT.
head(SAT)
## state expend ratio salary frac verbal math sat
## 1 Alabama 4.405 17.2 31.144 8 491 538 1029
## 2 Alaska 8.963 17.6 47.951 47 445 489 934
## 3 Arizona 4.778 19.3 32.175 27 448 496 944
## 4 Arkansas 4.459 17.1 28.934 6 482 523 1005
## 5 California 4.992 24.0 41.078 45 417 485 902
## 6 Colorado 5.443 18.4 34.571 29 462 518 980
The plot displays average SAT score against average teacher salary for all 50 US states.
What conclusion do you draw from the plot?
Are these results surprising?
M <- lm(data=SAT, sat~salary)
summary(M)
##
## Call:
## lm(formula = sat ~ salary, data = SAT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -147.125 -45.354 4.073 42.193 125.279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1158.859 57.659 20.098 < 2e-16 ***
## salary -5.540 1.632 -3.394 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.89 on 48 degrees of freedom
## Multiple R-squared: 0.1935, Adjusted R-squared: 0.1767
## F-statistic: 11.52 on 1 and 48 DF, p-value: 0.001391
Let’s break the data down by the percentage of students who take the SAT.
Low = 0%-22%
Medium = 22-49%
High = 49-81%
SAT <- mutate(SAT, fracgrp = cut(frac,
breaks=c(0, 22, 49, 81),
labels=c("low", "medium", "high")))
Now what conclusions do you draw from the plots?
M <- lm(data=SAT, sat~salary+frac)
summary(M)
##
## Call:
## lm(formula = sat ~ salary + frac, data = SAT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.313 -26.731 3.168 18.951 75.590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 987.9005 31.8775 30.991 <2e-16 ***
## salary 2.1804 1.0291 2.119 0.0394 *
## frac -2.7787 0.2285 -12.163 4e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.69 on 47 degrees of freedom
## Multiple R-squared: 0.8056, Adjusted R-squared: 0.7973
## F-statistic: 97.36 on 2 and 47 DF, p-value: < 2.2e-16
\(\text{Expected Average SAT} =\) \(987.9 + 2.18 \times \text{Average Teacher Salary} - 2.78 \times \text{Percentage taking SAT}\)
If a state has an average teacher salary of 40 thousand dollars and 30% of students take the SAT, the expected average SAT score would be
\(987.9 + 2.18 \times 40 -2.78 \times 30 = 991.7\)
predict(M, newdata=data.frame(salary=40, frac=30))
## 1
## 991.7554
On average, a $1,000 dollar increase in average teacher salary is associated with a 2 point increase in average SAT score assuming there is no change in fraction of students taking the SAT.
On average, a 1% increase in percentage of students taking the SAT is associated with a 2.78 point decrease in average SAT score assuming average teacher salary is held constant.
—>
A 1992 study by Chase and Dummer asked a random sample of students in grades 4-6, in the state of Michigan, which they thought was most importance: grades, being popular, or playing sports.
Questions of interest:
Null Hypothesis: There are no differences in preferences for all 4th, 5th, and 6th graders.
Alternative Hypothesis: There are differences in preferences between grade levels.
Counts:
| 4 | 5 | 6 | |
|---|---|---|---|
| Grades | 49 | 50 | 69 |
| Popular | 24 | 36 | 38 |
| Sports | 19 | 22 | 28 |
Percentages:
| 4 | 5 | 6 | |
|---|---|---|---|
| Grades | 53 | 46 | 51 |
| Popular | 26 | 33 | 28 |
| Sports | 21 | 20 | 21 |
. . . .
A \(\chi^2\) statistic tells us how “different” proportions are across groups. The larger the \(\chi^2\) value, the stronger the evidence of differences.
The \(\chi^2\) distribution is appropriate when each cell in the table has at least 5 observations.
chisq.test(T)
##
## Pearson's Chi-squared test
##
## data: T
## X-squared = 1.5126, df = 4, p-value = 0.8244
It would not be at all unusual to observe differences between grade levels as extreme as we saw in the table if there were really no differences at all. There is not evidence of differences between groups.
Null Hypothesis: There are no differences in preferences between students in rural, suburban, and urban settings.
Alternative Hypothesis: There are differences in preferences between settings.
Counts:
| Rural | Suburban | Urban | |
|---|---|---|---|
| Grades | 57 | 87 | 24 |
| Popular | 50 | 42 | 6 |
| Sports | 42 | 22 | 5 |
Percentages:
| Rural | Suburban | Urban | |
|---|---|---|---|
| Grades | 38 | 58 | 69 |
| Popular | 34 | 28 | 17 |
| Sports | 28 | 15 | 14 |
. . . .
chisq.test(T2)
##
## Pearson's Chi-squared test
##
## data: T2
## X-squared = 18.564, df = 4, p-value = 0.000957
It would be very surprising to observe differences between rural, suburban, and urban settings as extreme as we saw in the table if there were really no differences at all. There is strong evidence of differences between settings.
Researchers are interested in studying how being exposed to light at night impacts weight gain in mice. In a study, 27 mice were assigned to one of three light conditions:
After 3 weeks, researchers recorded the body mass gain (in grams) in the mice.
Null Hypothesis: The mean weight gain is the same for each light/dark setting, considering all mice.
Alternative Hypothesis: The mean weight gain is different for at least one of the light/dark settings.
These hypotheses can be tested using a procedure called ANalysis Of VAriance (ANOVA).
| Light | mean_Gain | sd | n |
|---|---|---|---|
| DM | 7.85900 | 3.009291 | 10 |
| LD | 5.92625 | 1.899420 | 8 |
| LL | 11.01000 | 2.623985 | 9 |
An F-test compares the amount of variability between groups to the amount of variability within groups.
| Scenario 1 | Scenario 2 | |
|---|---|---|
| variation between groups | High | Low |
| variation within groups | Low | High |
| F Statistic | Large | Small |
| Result | Evidence of Group Differences | No evidence of differences |
A <- aov(data=Mice, BMGain~Light)
summary(A)
## Df Sum Sq Mean Sq F value Pr(>F)
## Light 2 113.1 56.54 8.385 0.00173 **
## Residuals 24 161.8 6.74
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We have evidence that there are differences in mean bodymass gain in mice between the different light/dark settings.
We don’t know yet, which light/dark settings differ significantly from one another.
We use t-tests to compare each of the three groups.
pairwise.t.test(Mice$BMGain, Mice$Light, p.adj="none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: Mice$BMGain and Mice$Light
##
## DM LD
## LD 0.12972 -
## LL 0.01431 0.00049
##
## P value adjustment method: none
Bonferroni correction:
Since we are performing 3 comparisons, we should only conclude that
there are differences between groups if the p-value is less than 0.05/3=
0.0167. (for an overall 0.05 cutoff)
—>